NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Ripple: Asynchronous Programming for Spatial Dataflow Architectures

https://doi.org/10.1145/3729256

Ghosh, Souradip; Shi, Yufei; Lucia, Brandon; Beckmann, Nathan (June 2025, Proceedings of the ACM on Programming Languages)

Spatial dataflow architectures (SDAs) are a promising and versatile accelerator platform. They are software-programmable and achieve near-ASIC performance and energy efficiency, beating CPUs by orders of magnitude. Unfortunately, many SDAs struggle to efficiently implement irregular computations because they suffer from an abstraction inversion: they fail to capture coarse-grain dataflow semantics in the application — namely asynchronous communication, pipelining, and queueing — that are naturally supported by the dataflow execution model and existing SDA hardware. Ripple is a language and architecture that corrects the abstraction inversion by preserving dataflow semantics down the stack. Ripple provides asynchronous iterators, shared-memory atomics, and a familiar task-parallel interface to concisely express the asynchronous pipeline parallelism enabled by an SDA. Ripple efficiently implements deadlock-free, asynchronous task communication by exposing hardware token queues in its ISA. Across nine important workloads, compared to a recent ordered-dataflow SDA, Ripple shrinks programs by 1.9×, improves performance by 3×, increases IPC by 58%, and reduces dynamic instructions by 44%.
more » « less
Full Text Available
Pipestitch: An energy-minimal dataflow architecture with lightweight threads

https://doi.org/10.1145/3613424.3614283

Serafin, Nathan; Ghosh, Souradip; Desai, Harsh; Beckmann, Nathan; Lucia, Brandon (October 2023, Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '23))

Computing at the extreme edge allows systems with high-resolution sensors to be pushed well outside the reach of traditional communication and power delivery, requiring high-performance, high-energy-efficiency architectures to run complex ML, DSP, image processing, etc. Recent work has demonstrated the suitability of CGRAs for energy-minimal computation, but has focused strictly on energy optimization, neglecting performance. Pipestitch is an energy-minimal CGRA architecture that adds lightweight hardware threads to ordered dataflow, exploiting abundant, untapped parallelism in the complex workloads needed to meet the demands of emerging sensing applications. Pipestitch introduces a programming model, control-flow operator, and synchronization network to allow lightweight hardware threads to pipeline on the CGRA fabric. Across 5 important sparse workloads, Pipestitch achieves a 3.49 × increase in performance over RipTide, the state-of-the-art, at a cost of a 1.10 × increase in area and a 1.05 × increase in energy.
more » « less
RipTide: A Programmable, Energy-Minimal Dataflow Compiler and Architecture

https://doi.org/10.1109/MICRO56248.2022.00046

Gobieski, Graham; Ghosh, Souradip; Heule, Marijn; Mowry, Todd; Nowatzki, Tony; Beckmann, Nathan; Lucia, Brandon (October 2022, 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO))

Emerging sensing applications create an unprecedented need for energy efficiency in programmable processors. To achieve useful multi-year deployments on a small battery or energy harvester, these applications must avoid off-device communication and instead process most data locally. Recent work has proven coarse-grained reconfigurable arrays (CGRAs) as a promising architecture for this domain. Unfortunately, nearly all prior CGRAs support only computations with simple control flow and no memory aliasing (e.g., affine inner loops), causing an Amdahl efficiency bottleneck as non-trivial fractions of programs must run on an inefficient von Neumann core.RipTide is a co-designed compiler and CGRA architecture that achieves both high programmability and extreme energy efficiency, eliminating this bottleneck. RipTide provides a rich set of control-flow operators that support arbitrary control flow and memory access on the CGRA fabric. RipTide implements these primitives without tagged tokens to save energy; this requires careful ordering analysis in the compiler to guarantee correctness. RipTide further saves energy and area by offloading most control operations into its programmable on-chip network, where they can re-use existing network switches. RipTide’s compiler is implemented in LLVM, and its hardware is synthesized in Intel 22FFL. RipTide compiles applications written in C while saving 25% energy v. the state-of-the-art energy-minimal CGRA and 6.6 × energy v. a von Neumann core.
more » « less
Full Text Available
WARio: efficient code generation for intermittent computing

https://doi.org/10.1145/3519939.3523454

Kortbeek, Vito; Ghosh, Souradip; Hester, Josiah; Campanoni, Simone; Pawełczak, Przemysław (June 2022, Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation)

Intermittently operating embedded computing platforms powered by energy harvesting require software frameworks to protect from errors caused by Write After Read (WAR) dependencies. A powerful method of code protection for systems with non-volatile main memory utilizes compiler analysis to insert a checkpoint inside each WAR violation in the code. However, such software frameworks are oblivious to the code structure---and therefore, inefficient---when many consecutive WAR violations exist. Our insight is that by transforming the input code, i.e., moving individual write operations from unique WARs close to each other, we can significantly reduce the number of checkpoints. This idea is the foundation for WARio: a set of compiler transformations for efficient code generation for intermittent computing. WARio, on average, reduces checkpoint overhead by 58%, and up to 88%, compared to the state of the art across various benchmarks.
more » « less
Full Text Available
FPVM: Towards a Floating Point Virtual Machine

https://doi.org/10.1145/3502181.3531469

Dinda, Peter; Wanninger, Nick; Ma, Jiacheng; Bernat, Alex; Bernat, Charles; Ghosh, Souradip; Kraemer, Christopher; Elmasry, Yehya (June 2022, Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2022) June 2022)

Full Text Available
Compiler-Based Timing For Extremely Fine-Grain Preemptive Parallelism

https://doi.org/10.1109/SC41405.2020.00057

Ghosh, Souradip; Cuevas, Michael; Campanoni, Simone; Dinda, Peter (November 2020, Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 2020),)
null (Ed.)
In current operating system kernels and run-time systems, timing is based on hardware timer interrupts, introducing inherent overheads that limit granularity. For example, the scheduling quantum of preemptive threads is limited, resulting in this abstraction being restricted to coarse-grain parallelism. Compiler-based timing replaces interrupts from the hardware timer with callbacks from compiler-injected code. We describe a system that achieves low-overhead timing using whole-program compiler transformations and optimizations combined with kernel and run-time support. A key novelty is new static analyses that achieve predictable, periodic run-time behavior from the transformed code, regardless of control-flow path. We transform the code of a kernel and run-time system to use compiler-based timing and leverage the resulting fine-grain timing to extend an implementation of fibers (cooperatively scheduled threads),attaining what is effectively preemptive scheduling. The result combines the fine granularity of the cooperative fiber model with the ease of programming of the preemptive thread model.
more » « less
Full Text Available
CARAT CAKE: replacing paging via compiler/kernel cooperation

https://doi.org/10.1145/3503222.3507771

Suchy, Brian; Ghosh, Souradip; Kersnar, Drew; Chai, Siyuan; Huang, Zhen; Nelson, Aaron; Cuevas, Michael; Bernat, Alex; Chaudhary, Gaurav; Hardavellas, Nikos; et al (February 2022, Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems)

Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.
more » « less
Full Text Available
NOELLE Offers Empowering LLVM Extensions

https://doi.org/10.1109/CGO53902.2022.9741276

Matni, Angelo; Deiana, Enrico Armenio; Su, Yian; Gross, Lukas; Ghosh, Souradip; Apostolakis, Sotiris; Xu, Ziyang; Tan, Zujun; Chaturvedi, Ishita; Homerding, Brian; et al (April 2022, 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO))

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.
more » « less
Full Text Available

Search for: All records